Development and validation of machine learning-based models for predicting healthcare-associated bacterial/fungal infections among COVID-19 inpatients: a retrospective cohort study

Background COVID-19 and bacterial/fungal coinfections have posed significant challenges to human health. However, there is a lack of good tools for predicting coinfection risk to aid clinical work. Objective We aimed to investigate the risk factors for bacterial/fungal coinfection among COVID-19 patients and to develop machine learning models to estimate the risk of coinfection. Methods In this retrospective cohort study, we enrolled adult inpatients confirmed with COVID-19 in a tertiary hospital between January 1 and July 31, 2023, in China and collected baseline information at admission. All the data were randomly divided into a training set and a testing set at a ratio of 7:3. We developed the generalized linear and random forest models for coinfections in the training set and assessed the performance of the models in the testing set. Decision curve analysis was performed to evaluate the clinical applicability. Results A total of 1244 patients were included in the training cohort with 62 healthcare-associated bacterial/fungal infections, while 534 were included in the testing cohort with 22 infections. We found that patients with comorbidities (diabetes, neurological disease) were at greater risk for coinfections than were those without comorbidities (OR = 2.78, 95%CI = 1.61–4.86; OR = 1.93, 95%CI = 1.11–3.35). An indwelling central venous catheter or urinary catheter was also associated with an increased risk (OR = 2.53, 95%CI = 1.39–4.64; OR = 2.28, 95%CI = 1.24–4.27) of coinfections. Patients with PCT > 0.5 ng/ml were 2.03 times (95%CI = 1.41–3.82) more likely to be infected. Interestingly, the risk of coinfection was also greater in patients with an IL-6 concentration < 10 pg/ml (OR = 1.69, 95%CI = 0.97–2.94). Patients with low baseline creatinine levels had a decreased risk of bacterial/fungal coinfections(OR = 0.40, 95%CI = 0.22–0.71). The generalized linear and random forest models demonstrated favorable receiver operating characteristic curves (ROC = 0.87, 95%CI = 0.80–0.94; ROC = 0.88, 95%CI = 0.82–0.93) with high accuracy, sensitivity and specificity of 0.86vs0.75, 0.82vs0.86, 0.87vs0.74, respectively. The corresponding calibration evaluation P statistics were 0.883 and 0.769. Conclusions Our machine learning models achieved strong predictive ability and may be effective clinical decision-support tools for identifying COVID-19 patients at risk for bacterial/fungal coinfection and guiding antibiotic administration. The levels of cytokines, such as IL-6, may affect the status of bacterial/fungal coinfection.

The long-term impacts of viral and bacterial/fungal coinfections on antimicrobial resistance are severe public problems [9].It is difficult for clinicians to identify coinfections early because of similar symptoms and signs, thus leading to a high rate of inappropriate prescription [10][11][12].Early empiric antibiotic use varied from 27 to 84% across different hospitals [10].Two multicenter cohort studies [10,11] showed that the proportions of bacterial coinfection were lower than 10%, while the proportions of early empirical antibiotics were as high as 60%.However, without bacterial coinfections, antibiotic overuse not only does not benefit patients but also accelerates the development of antimicrobial resistance.
Recent studies [9,11,17] have used scientific statistical methods to estimate the risk of healthcare-associated bacterial coinfections in COVID-19 patients, instead of limiting the identification of risk factors.Estimating the probability of an individual developing healthcare-associated infections could aid in earlier intervention, such as prescribing antibiotics or providing appropriate patient care.Therefore, establishing accurate predictive models has practical significance for clinical work and is beneficial for identifying high-risk patients and preventing and controlling them precisely.
As machine learning (ML) is used for disease diagnosis or prognosis prediction, it is feasible to identify patients at high risk of bacterial coinfections [9,11,17].Compared to traditional models, machine learning models have faster processors and smarter algorithms [18,19].Rapid progress in machine learning has provided opportunities for improved patient healthcare [20].In this retrospective cohort study, we investigated the risk factors and established different ML models to predict the risk of healthcare-associated bacterial/fungal coinfections among inpatients with COVID-19.

Inclusion and exclusion criteria
Inpatients who tested positive for COVID-19 according to nasopharyngeal swab PCR between January 1 and July 31, 2023 in a tertiary hospital in China were included.This hospital serves a population of more than nine million people and provides tertiary referral services to the surrounding regions.The exclusion criteria were as follows: [1] patients under 18 years of age [2], had a hospital stay less than three days, and [3] repeated patients.

Definitions
According to the CDC/NHSN surveillance definition, healthcare-associated infections, also known as hospital-acquired infections, occur while receiving health care in the healthcare facility or hospital, are usually acquired ≥ 48 h after admission, and are not present or might be incubating on admission [21][22][23][24][25][26].
Healthcare-associated bacterial/fungal coinfections among COVID-19 inpatients: COVID-19 inpatients with signs of bacterial or fungal infection that develop 48 h after admission and have positive cultures are considered healthcare-associated bacterial/fungal coinfections.Our study excludes community-acquired infections [8].
Neurological diseases refer to disorders affecting the brain, spinal cord, and nerves throughout the body, including Parkinson's disease, Alzheimer's disease, multiple sclerosis, stroke, epilepsy, migraines, neuralgia, and various types of brain and spinal cord injuries.

Study design and data collection
We have a real-time healthcare-associated infection surveillance system to monitor infections closely.Inpatients' clinical information is recorded in the real-time surveillance system where clinicians and infection prevention and control professionals (IPCs) could receive early warnings about infections such as fever(> 38℃), elevated inflammatory markers(WBC or neutrophil count, PCT, IL-6, CRP), chest CT showing inflammation, antibiotic use or escalating antibiotic use, and positive cultures.Microbiological isolation is mandatory to confirm a bacterial/fungal infection.According to the symptoms and signs of the patient, clinicians will collect the specimens from suspected infection sites for etiological cultures, such as blood, urine, bronchoalveolar lavage(BAL), sputum, pleural fluid, ascites, and other specimens.Clinicians will diagnose and report healthcare-associated bacterial/fungal infections to the surveillance system.Meanwhile, IPCs will review medical record information to verify the occurrence or absence of infections.In summary, whether a healthcare-associated bacterial/fungal infection has occurred will be determined according to the symptoms and signs of patients and the culture-positive results of the suspected infection site.Based on the real-time surveillance system and microbiology culture, we can identify healthcare-associated bacterial/fungal infections as much as possible.
In this retrospective, single-center cohort study, data including demographic information, comorbidity information and laboratory results at admission were collected directly from the surveillance system.All predictive factors in our study preceded the outcome instead of a random point during the hospital stay.We also collected treatment information such as operation history, invasive ventilation, urinary catheter, meprednisone, dexamethasone, and tocilizumab before the infections occurred.
Continuous variables are reported as the medians and inter-quartile ranges (IQRs) and were compared using the Kruskal-Wallis test.Categorical variables are reported as counts and percentages and were compared using the Chi-sq or Fisher's exact test.We conducted univariate and stepwise multivariate logistic regression analyses to investigate risk factors for healthcare-associated(HA) bacterial/fungal infection.Factors with a P-value less than 0.05 were independently associated with HA infections.Adjusted odds ratios (AORs) and 95% confidence intervals (95%CIs) were estimated.

Model development and internal validation
We randomly divided all the samples into a training set and a testing set at a ratio of 7:3.The training set was used to screen variables and develop models, while the testing set was used for model evaluation.We selected variables for the model development which were statistically significant in our univariate analysis.The models included 14 candidate predictors, as follows: diabetes, kidney disease, neurological disease, ICU admission, PCT_level, albumin (ALB_level), creatinine (Cr_level), IL-6_level, CRP_level, neutrophil percent (Ne_level), central venous catheter (CVC), urinary catheter (UC), invasive ventilation (IV), and dexamethasone (DXM).The variance inflation factors (VIF) were calculated to assess the multicollinearity of the predictors.As all the predictors had a VIF less than 2, indicating no multicollinearity, all the predictors were included in the model development.
A random forest model was established (ntree = 500, mtry = 4) and the importance of the variables was determined.Our study compared the discrimination of models by the area under the receiver operating curve (AUCROC).The calibration slopes were calculated to check the risk of overfitting.Decision curve analyses were performed to evaluate whether the risk models improved clinical decision-making [27].

Baseline characteristics
A total of 1946 inpatients were diagnosed with laboratory-confirmed with COVID-19 between January 1 and July 31, 2023.As shown in the Figs. 1 and 1778 eligible inpatients were enrolled in this study.The median age of the patients was 69 years (interquartile rage (IQR), 56-80 years), and 1043 were male (58.66%).The Table 1 shows the difference in baseline characteristics between the HA infection group and the Non-HA infection group.Eightyfour (4.72%) patients developed healthcare-associated bacterial/fungal infections, 75 of whom were bacterial infections and 9 of whom had fungal infections.The most common bacterial strain isolated was klebsiella pneumoniae which was found in 18 patients and the main infection site was the lower respiratory tract.
According to random sampling results, a total of 1244 patients in the training set had 62 HA infections, while 534 patients in the testing set had 22 HA infections.There was no significant difference in the HA infection rate between the two groups (P = 0.51).

General linear model
The result of the ANOVA test (P = 0.66) indicated no significant difference between the full and stepwise models, and the AIC of the stepwise model was lower (417.22)than that of the full model (426.17).Thus, the stepwise logistic regression model was chosen as the final general linear model and included 7 predictors, as shown in Table 3.
As shown in Table 3, compared with patients without diabetes, patients with diabetes had a 2.

Random forest model
The RF model was trained using 1244 inpatients and 14 variables.The random forest model yielded an out-ofbag error of 4.98%.As shown in Fig. 2, the importance of the variables was obtained as follows: using the mean decrease in Gini as a criterion, neurological disease, diabetes, IL-6 levels and dexamethasone made the greatest contributions.

Discrimination
The two different models achieved comparable performance levels, as shown in Fig. 3.The AUCROCs for the GLM and RFM were 0.87(95%CI = 0.80-0.94)and 0.88(95%CI = 0.82-0.93),respectively.The RFM slightly outperformed than the GLM.The sensitivities of both models were greater than 80%.

Calibration
As shown in Fig. 4, the calibration lines were close to the ideal lines, and a slope of 1 indicated no overfitting.The Dxy over 0.7 indicated good correlations between the predictive and actual values, which showed that RFM was better than GLM(0.824vs0.734).The mean square error(Brier) of GLM and RFM were 0.032 and 0.028, respectively, and the smaller the better.The S: p was the P value(> 0.05) of the Z test, which indicated the fitness effects were relatively excellent.Those indicators in the two models were closed, but the calibration of RFM outperformed slightly than that of GLM.

Decision curve
As shown in Fig. 5, both models had greater standard net benefits than default strategies across the threshold range.Thus, both models had better utility in supporting clinical decisions and led to the best decisions.

Discussion
Bacterial/fungal coinfection is a serious complication of COVID-19, especially in the presence of comorbidities, and can lead to a worse prognosis and antibiotic overuse [28].In the present study, of a total of 1778 patients hospitalized with COVID-19, approximately 5% presented with bacterial/fungal coinfections.We has investigated the risk factors associated with bacterial/fungal infections and developed machine learning-based models with robust predictive performance.The algorithm showed that comorbidities (diabetes, neurological diseases), invasive procedures (central venous catheter, urinary catheter), baseline inflammatory markers levels (IL-6, PCT), and creatinine were associated with an increased risk of bacterial/fungal infection.Those predictors are less expensive, faster, and easier to obtain from electronic medical records.The machine learning-based models are preferred methods for infection surveillance and disease prognosis, which makes it easier to identify high-risk inpatients.When the estimated coinfection risk is low, it is recommended to limit or use antibiotics cautiously, whereas high-risk estimates suggest enhancing surveillance or resource reallocation through additional patient care or enhanced disinfection, which could improve the efficiency of hospital infection surveillance [29].Early detection of high-risk patients is beneficial for preventing hospital infection outbreaks, antibiotic overuse, and microbial resistance.Diabetes is related to various infections, especially skin, lower respiratory tract, and urinary tract infections [30].A review suggested that diabetes and its comorbidity may lead to some infectious diseases due to metabolic disturbances [30].Similarly, Suheda Erener [31] summarized the clinical data showing that diabetes and neurological disease may render patients more vulnerable to  infectious diseases.In line with the findings of previous studies [2,12], multivariate logistic analysis indicated that central venous and urinary catheters are associated with increased infection risk.The most common infection source of catheters is intradermal and catheter interface contamination by organisms, which may come from the patient's skin or from healthcare workers' hands.Patients with catheters have severe disease and lower immunity, so it is harder to defend against bacterial invasion.In our study, these factors were inputted as strong predictors for model development which gained promising results for risk estimates.
PCT is a well-known biomarker of bacterial infection and is involved in the early recognition of bacterial coinfection in patients with influenza pneumonia.Several studies have noted that high PCT levels on admission are associated with severe outcomes in critically ill patients [28,32].We found that PCT > 0.5 ng/ml was associated with an increased coinfection risk, which had a significant predictive value for bacterial/fungal coinfection among COVID-19 patients.Similarly, a study reported that a PCT cut-off value at 0.55ng/mL on admission may help identify bacterial coinfections [33].However, a meta-analysis concluded that PCT has limited predictive value for bacterial coinfections, but lower PCT levels might indicate a decreased risk [34].Although the value of PCT in predicting bacterial coinfection in patients has remained controversial, a continuous increase in PCT levels may indicate bacterial coinfections and progression toward more severe complications [35][36][37].Nonetheless, clinicians could consider not administrating antibiotics in patients with a PCT level lower than 0.5 ng/ml, which could be a helpful decision-support tool to guide antibiotic therapies for COVID- 19 [33, 38, 39].
IL-6 is a prototypical cytokine with pleiotropic activity that contributes to maintaining homeostasis [40].Previous reports have investigated that an acute infection response induces rapid production of IL-6, which activates the host defense mechanism against infection through elevated acute-phase proteins and the immune response [40,41].In our study, a level of IL-6 lower than 10 pg/mL may indicate bacterial/fungal coinfections, likely due to immunosuppression or corticosteroid therapy in the hospital.If the produced IL-6 level is deficient at the acute infection response phase, the host might not defend against secondary infections.However, excessive IL-6 levels and uncontrolled IL-6 receptor signaling are common in critically ill patients [42].By being vigilant and monitoring IL-6 levels, healthcare professionals can identify potential coinfections and provide appropriate treatment, ultimately improving patient outcomes.Cytokine storm, exacerbation synthesis of cytokines, can deteriorate the patient's clinical conditions [43].Future studies could explore cytokine levels and changes at different phases in bacterial/fungal coinfection and their impact on prognosis among COVID-19 patients.
Creatinine is a biomarker of kidney function.Several studies evaluated the association between biomarkers of abnormal kidney disease and death in COVID-19 patients, which found that patients with increased creatinine or low glomerular filtration rate at baseline had a poor prognosis [44,45].Our study pointed out that patients with low creatinine levels at baseline had a decreased risk of bacterial/fungal coinfections, which possibly because acute kidney function injury has not yet occurred.However, the relationship between kidney disease and post-acute COVID-19 syndrome is not yet determined, and prospective studies need to measure more laboratory biomarkers, such as glomerular filtration rate and urinary β 2 -microglobulin, to assess kidney function [46].
In summary, these factors are invaluable in accurately predicting and assessing the risk of bacterial/fungal coinfections.Incorporating them into our models not only enables us to make informed decisions but also helps us take proactive measures to prevent such infections.
Recent studies have initiated the prediction models to identify bacterial coinfections among CPVID-19 patients.A study [11] in Italy calculated a predictive risk score by assigning a point value according to the β coefficient to classify patients at risk of bacterial coinfection.This intuitive approach may be useful in diagnostic testing and antibiotic use.Machine-learning(ML) algorithms are novel and rapidly evolving technologies providing opportunities for clinical decision support in healthcare [11].RAWSON T M et al. [9] have demonstrated that a support vector machine (SVM) with 21 blood test variables can accurately predict positive microbiological samples.However, it's important to note that the study only focused on comparing algorithm performance and piloting the algorithm on a small group of patients who were admitted to the hospital.Ferentzakis et al. [47]have conducted five ML techniques to explore the association rules in antimicrobial resistance profiles in the ICU.They have forecast antimicrobial resistance of Acinetobacter baumannii, Klebsiella pneumoniae, and Pseudomonas aeruginosa, which could be a lowcost decision-support tool in selecting the appropriate empirical antibiotic treatment [48].Another study [29] has developed ML models for the surveillance of surgical site infections(SSI), which demonstrated that ML could improve the efficiency of SSI surveillance by decreasing the burden of chart review with high sensitivity.
Discrimination is a traditional performance metric in model evaluation that uses the AUCROC or C statistic to compare models.In our study, the AUCROCs of the two models exceed 0.85 with excellent discrimination, which indicated those models well differentiated highrisk groups from those at lower risk.However, discrimination alone is insufficient to assess the performance of predictive models, and calibration or goodness of fit is often regarded as most reliable property of a model [49].Few studies have drawn calibration curves to evaluate the matching degree between predicted and actual probabilities [20].Our calibration lines were close to the ideal calibration line.Both slopes were approximately equal to 1, and the intercepts were equal to 0, indicating no overfitting, overestimates, or underestimates of our models.The Dxy indicated the correlations between the predictive and actual values, which showed that RFM was better than GLM(0.824vs0.734).The mean square error(Brier) of GLM and RFM were 0.032 and 0.028, respectively, the smaller the better.So, the calibration of RFM outperformed slightly than that of GLM.The decision curves showed that these models had greater standard net benefits across all risk thresholds, which indicated that early management of high-risk patients could be beneficial according to our models [20].In summary, we should combine multiple measures to evaluate the pros and cons of models.
Our study has several limitations.First, we may underestimate the prevalence of bacterial/fungal infections.Generally, clinicians and IPCs diagnose and report healthcare-associated infection cases, and the number of cases detected partly relies on the extent of their efforts and the sensitivities of surveillance.Some infections might not be included due to the low culture-positive rate such as blood and cerebrospinal fluid samples.Second, some indicators, heart failure, cirrhosis, chronic kidney disease(CKD), glomerular filtration rate(GFR), ferritin, and suPAR levels, which may be associated with the prognosis of COVID-19 infection, have not been selected as the candidate predictors due to the retrospective study design.In the future, prospective and multi-center studies can directly measure more parameters to improve and externally validate the predicting models.Third, we did not test other viral infections, but viral coinfections are also significant to the prognosis of COVID-19 patients.However, identifying the risk factors of bacterial/fungal coinfections and estimating the probability of coinfections could guide the rational use of antibiotics.

Conclusions
Our results indicate that the machine learning models achieved strong predictive ability and may be effective clinical decision-support tools for bacterial/fungal infection surveillance and for guiding antibiotic administration.The GLM suggested that patients with an IL-6 concentration < 10pg/ml are more vulnerable to developing a bacterial/fungal infection.

Fig. 1
Fig. 1 Flowchart of study participant selection and model development and validation

Fig. 5 Fig. 4
Fig. 5 Decision curves for the default strategies and for GLM and RFM(the testing set, n = 534)

Table 1
Demographic characteristics, comorbidities, and laboratory test results for patients with HA bacterial/fungal infections and non-HA infections at baseline

Table 2
Sampling results for the training set and testing set

Table 3
Univariate and multivariate logistic regression analyses with the stepwise method in the training set (n = 1244)

Table 4
Statistics and classification matrix of the testing set